The Foundations of Thread-level Parallelism in the SuperMatrix Runtime System∗
نویسنده
چکیده
In this paper, we describe the interface and implementation of the SuperMatrix runtime system. SuperMatrix exploits parallelism from matrix computations by mapping a linear algebra algorithm to a directed acyclic graph (DAG). We give detailed descriptions of how to dynamically construct a DAG where tasks consisting of matrix operations represent the nodes and data dependencies between tasks represent the edges of the graph. We show the algorithm that, given a DAG as input, dispatches and schedules tasks to threads. Different scheduling heuristics and optimizations implemented as part of SuperMatrix are discussed, demonstrating the flexibility and portability that results from the separation of concerns. Using this flexible framework, we compare several scheduling algorithms, such as work stealing, that optimize for either load balancing or data locality, and we demonstrate that a relatively simple, single queue implementation provides exceptional performance while also allowing for the widest flexibility for further enhancements. Performance results from a sixteen core machine are provided.
منابع مشابه
Efficient Runtime Thread Management for the Nano-Threads Programming Model
The nano-threads programming model was proposed to effectively integrate multiprogramming on shared-memory multiprocessors, with the exploitation of fine-grain parallelism from standard applications. A prerequisite for the applicability of the nano-threads programming model is the ability of the runtime environment to manage parallelism at any level of granularity with minimal overheads. In thi...
متن کاملExploiting fine-grain thread parallelism on multicore architectures
In this work we present a runtime threading system which provides an efficient substrate for fine-grain parallelism, suitable for deployment in multicore platforms. Its architecture encompasses a number of optimizations that make it particularly effective in managing a large number of threads and with low overheads. The runtime system has been integrated into an OpenMP implementation to allow f...
متن کاملCompiling Data-parallel Programs to a Distributed Runtime Environment with Thread Isomigration
Traditionally, the compilation of data-parallel languages is targeted to low-level runtime environments: abstract processors are mapped onto static system processes, which directly address the low-level IPC library. Alternatively, we propose to map each HPF abstract processor onto a “lightweight process” (thread) which can be freely migrated between nodes together with the data it manages, unde...
متن کاملAn Algorithm-by-Blocks for SuperMatrix Band Cholesky Factorization
We pursue the scalable parallel implementation of the factorization of band matrices with medium to large bandwidth targeting SMP and multi-core architectures. Our approach decomposes the computation into a large number of fine-grained operations exposing a higher degree of parallelism. The SuperMatrix run-time system allows an out-of-order scheduling of operations that is transparent to the pr...
متن کاملRuntime Support for Multigrain and Multiparadigm Parallelism
This paper presents a general methodology for implementing on clusters the runtime support for a two-level dependence-driven thread model, initially targeted to shared-memory multiprocessors. The general ideal is to exploit existing programming solutions for these architectures, like Software DSM (SWDSM) and Message Passing Interface. The management of the internal runtime system structures and...
متن کامل